library(cleaningtools)
library(dplyr)
my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices02 - R framework with IMPACT - session 2
This section continues with the cleaning step.
Recap
more_logs <- my_raw_dataset %>%
check_duplicate(uuid_column = "X_uuid") %>%
check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
check_outliers(uuid_column = "X_uuid") %>%
check_value(uuid_column = "X_uuid")
more_logs$checked_dataset <- more_logs$checked_dataset %>%
add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs <- more_logs %>%
check_duration(column_to_check = "duration", uuid_column = "X_uuid")
other_columns_to_check <- my_kobo_survey %>%
filter(type == "text") %>%
filter(name %in% names(my_raw_dataset)) %>%
pull(name)
more_logs <- more_logs %>%
check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check)
logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
more_logs <- more_logs %>%
check_logical_with_list(uuid_column = "X_uuid",
list_of_check = logical_check_list,
check_id_column = "check_id",
check_to_perform_column = "check_to_perform",
columns_to_clean_column = "columns_to_clean",
description_column = "description")Cleaning - Creating a cleaning log
create_combined_log
create_combined_log will combined all the logs from a list into one. It will also add 2 columns that will be used for the cleaning.
names(more_logs)[1] "checked_dataset" "duplicate_log" "soft_duplicate_log"
[4] "potential_outliers" "flaged_value" "duration_log"
[7] "other_log" "logical_all"
my_combined_log <- create_combined_log(more_logs)List of element to combine- checked_dataset, duplicate_log, soft_duplicate_log, potential_outliers, flaged_value, duration_log, other_log, logical_all
typeof(my_combined_log)[1] "list"
names(my_combined_log)[1] "checked_dataset" "cleaning_log"
my_combined_log$cleaning_log %>%
head()| uuid | old_value | question | issue | check_id | check_binding | change_type | new_value |
|---|---|---|---|---|---|---|---|
| b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | 86 | age_respondent_r | outlier (normal distribution) | NA | age_respondent_r / b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | NA | NA |
| 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | 84 | age_respondent_r | outlier (normal distribution) | NA | age_respondent_r / 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | NA | NA |
| 97ad6294-30c6-454e-a0b3-42126415b767 | 18 | age_respondent_r | outlier (log distribution) | NA | age_respondent_r / 97ad6294-30c6-454e-a0b3-42126415b767 | NA | NA |
| e005e719-57c4-44a3-ac2f-5d6d1ff68831 | 18 | age_respondent_r | outlier (log distribution) | NA | age_respondent_r / e005e719-57c4-44a3-ac2f-5d6d1ff68831 | NA | NA |
| c9aaa542-118f-4e42-93de-fb0916572541 | 19 | num_hh_member | outlier (normal distribution) | NA | num_hh_member / c9aaa542-118f-4e42-93de-fb0916572541 | NA | NA |
| 48e8896b-d1be-4600-8839-2d8b994ebcfb | 19 | num_hh_member | outlier (normal distribution) | NA | num_hh_member / 48e8896b-d1be-4600-8839-2d8b994ebcfb | NA | NA |
The cleaning log contains all the columns from all the logs from more_logs with in addition:
- check_binding is filled for all rows.
- change_type (empty)
- new_value (empty)
add_info_to_cleaning_log
If more information from the dataset should be added, the function add_info_to_cleaning_log can help.
my_combined_log <- my_combined_log %>%
add_info_to_cleaning_log(dataset_uuid_column = "X_uuid",
information_to_add = "enumerator_num")
my_combined_log$cleaning_log %>%
head()| uuid | old_value | question | issue | check_id | check_binding | change_type | new_value | enumerator_num |
|---|---|---|---|---|---|---|---|---|
| 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | لا اعلم | water_supply_other_neighbourhoods_why | recode other | NA | water_supply_other_neighbourhoods_why / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | NA | NA | 12 |
| 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | لا ارى جدوى من ذلك | prefer_not_engage_other | recode other | NA | prefer_not_engage_other / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | NA | NA | 12 |
| 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | السلطات ليست مهتمة بالخدمات | trust_water_office_why_not | recode other | NA | trust_water_office_why_not / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | NA | NA | 12 |
| 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | 247.20 | duration | Duration is lower or higher than the thresholds | NA | duration / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | NA | NA | 12 |
| 03183d24-0275-43fe-8976-d076f29de590 | عدم توفير خدمه المياه في المنطقه وضعف في تزويد خدمه وعدد ساعات اقل تجهيز للبيوت | water_supply_other_neighbourhoods_why | recode other | NA | water_supply_other_neighbourhoods_why / 03183d24-0275-43fe-8976-d076f29de590 | NA | NA | 2 |
| 03183d24-0275-43fe-8976-d076f29de590 | 10 | pay_water_charges_amount | outlier (log distribution) | NA | pay_water_charges_amount / 03183d24-0275-43fe-8976-d076f29de590 | NA | NA | 2 |
create_xlsx_cleaning_log
create_xlsx_cleaning_log(my_combined_log,
sm_dropdown_type = "logical",
output_path = "outputs/01 - example - cleaning-log-no-kobo.xlsx")create_xlsx_cleaning_log will write an excel file with:
- checked_dataset tab: the checked dataset, with additional columns if any.
- cleaning_log tab: the combined log with the change_type column with a data validation rules.
- readme tab: change_type values definition.
There are 4 actions possible:
- change_response: Change the response to new.value
- blank_response: Remove and NA the response
- remove_survey: Delete the survey
- no_action: No action to take.
This log will have to be filled in with actions to take and new value if needed.
create_xlsx_cleaning_log(my_combined_log,
kobo_survey = my_kobo_survey,
kobo_choices = my_kobo_choice,
use_dropdown = T,
sm_dropdown_type = "logical",
output_path = "outputs/02 - example - cleaning-log-with-kobo.xlsx")If the KOBO information are provided and the use_dropdown argument is set to TRUE, new_value will have a data validation rule based on the KOBO options.
Select multiple dummy columns (TRUE/FALSE or 1/0) are flagged and used later for the cleaning, not the parent column.
Cleaning - Creating a clean dataset
The cleaning has to be filled before moving forward. The above steps are to create the cleaning log, not to fill it or clean the dataset. Filling the cleaning log is not an automatic step for this process. The decision to change a value has to be recorded.
review_cleaning_log
review_cleaning_log will review the filled cleaning log.
my_filled_log <- readxl::read_excel("inputs/02 - example - cleaning-log-with-kobo - filled.xlsx", sheet = 2)
check_log_results <- review_cleaning_log(raw_dataset = my_raw_dataset,
raw_data_uuid_column = "X_uuid",
cleaning_log = my_filled_log,
cleaning_log_uuid_column = "uuid",
cleaning_log_question_column = "question",
cleaning_log_new_value_column = "new_value",
cleaning_log_change_type_column = "change_type",
change_response_value = "change_response")
check_log_results[1] "no issues in cleaning log found"
create_clean_data
To create the clean dataset, create_clean_data will use the raw dataset and the filled cleaning log.
my_clean_data <- create_clean_data(raw_dataset = my_raw_dataset,
raw_data_uuid_column = "X_uuid",
cleaning_log = my_filled_log,
cleaning_log_uuid_column = "uuid",
cleaning_log_question_column = "question",
cleaning_log_new_value_column = "new_value",
cleaning_log_change_type_column = "change_type")[1] "water_supply_other_neighbourhoods_why"
[1] "trust_water_office_why_not"
[1] "pay_water_charges_amount"
[1] "connection_fees_amount"
[1] "connection_fees_amount"
[1] "primary_livelihood.employment"
[1] "primary_livelihood.employment"
[1] "primary_livelihood.employment"
[1] "tank_emptied"
[1] "access_water_enough"
recreate_parent_column
In the cleaning log, some select multiple are changed, but only the dummy.
my_filled_log %>%
filter(question == "primary_livelihood.employment",
change_type == "change_response") %>%
select(uuid, question, old_value, new_value)| uuid | question | old_value | new_value |
|---|---|---|---|
| eb3b9935-fa0b-4d54-8058-3b629f1421ad | primary_livelihood.employment | TRUE | FALSE |
| eec1f630-15d5-475e-a344-32bba74b32ea | primary_livelihood.employment | TRUE | FALSE |
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | primary_livelihood.employment | TRUE | FALSE |
The parent column did not change, in the following example employment still appears in the parent column.
my_clean_data %>%
filter(X_uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>%
select(X_uuid,primary_livelihood, primary_livelihood.employment)| X_uuid | primary_livelihood | primary_livelihood.employment |
|---|---|---|
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | employment | FALSE |
| eb3b9935-fa0b-4d54-8058-3b629f1421ad | employment ngo | FALSE |
| eec1f630-15d5-475e-a344-32bba74b32ea | employment | FALSE |
recreate_parent_column will recode the parent columns based on the dummies.
my_clean_data2 <- recreate_parent_column(dataset = my_clean_data,
uuid_column = "X_uuid",
kobo_survey = my_kobo_survey,
kobo_choices = my_kobo_choice,
sm_separator = ".",
cleaning_log_to_append = my_filled_log)The parent are corrected, employment does not appear in the parent column.
my_clean_data2$data_with_fix_concat %>%
filter(X_uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>%
select(X_uuid,primary_livelihood, primary_livelihood.employment)| X_uuid | primary_livelihood | primary_livelihood.employment |
|---|---|---|
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | NA | FALSE |
| eb3b9935-fa0b-4d54-8058-3b629f1421ad | ngo | FALSE |
| eec1f630-15d5-475e-a344-32bba74b32ea | NA | FALSE |
The changes were added to the cleaning log.
my_clean_data2$cleaning_log %>%
filter(question == "primary_livelihood",
uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>%
select(uuid, question, old_value, new_value)| uuid | question | old_value | new_value |
|---|---|---|---|
| eb3b9935-fa0b-4d54-8058-3b629f1421ad | primary_livelihood | employment ngo | ngo |
| eec1f630-15d5-475e-a344-32bba74b32ea | primary_livelihood | employment | NA |
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | primary_livelihood | employment | NA |
review_others
In the cleaning log, some opentext values are changed to blank. Some open text questions are linked some skip logic, i.e. what is X? Other, please specify. In some cases, values some values should be changed.
In the example below, the value for water_supply_other_neighbourhoods_why for the uuid 019bc718-c06a-46b8-bba8-c84f6c6efbd5 was changed to NA.
my_filled_log %>%
filter(question == "water_supply_other_neighbourhoods_why",
change_type == "blank_response")| uuid | old_value | question | issue | check_id | check_binding | change_type | new_value | enumerator_num |
|---|---|---|---|---|---|---|---|---|
| 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | لا اعلم | water_supply_other_neighbourhoods_why | recode other | NA | water_supply_other_neighbourhoods_why / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 | blank_response | NA | 12 |
The kobo show a skip logic based on water_supply_other_neighbourhoods.
my_kobo_survey %>%
filter(name == "water_supply_other_neighbourhoods_why") %>%
select(type, name, relevant)| type | name | relevant |
|---|---|---|
| text | water_supply_other_neighbourhoods_why | selected(\({water_supply_other_neighbourhoods},'somewhat_worse') or selected(\){water_supply_other_neighbourhoods},‘much_worse’) |
my_clean_data %>%
filter(X_uuid == "019bc718-c06a-46b8-bba8-c84f6c6efbd5") %>%
select(water_supply_other_neighbourhoods, water_supply_other_neighbourhoods_why )| water_supply_other_neighbourhoods | water_supply_other_neighbourhoods_why |
|---|---|
| somewhat_worse | NA |
Should the value of water_supply_other_neighbourhoods be changed? It depends on the question and skip logic but it important to flag those so a decision can be taken.
review_other_log <- review_others(dataset = my_clean_data2$data_with_fix_concat,
uuid_column = "X_uuid",
kobo_survey = my_kobo_survey,
columns_not_to_check = "consent_telephone_number")Warning in create_logic_for_other(kobo_survey = kobo_survey,
compare_with_dataset = TRUE, : The following parent names: well_quality,
spring_quality, rainwater_quality, surface_quality, why_not_connected were not
found in the dataset. The function is ignoring them.
review_cleaning
my_deletion_log <- my_clean_data2$cleaning_log %>%
filter(change_type == "remove_survey")
my_filled_log_no_deletion <- my_clean_data2$cleaning_log %>%
filter(change_type != "remove_survey") %>%
filter(!uuid %in% my_deletion_log$uuid)
review_of_cleaning <- review_cleaning(raw_dataset = my_raw_dataset,
raw_dataset_uuid_column = "X_uuid",
clean_dataset = my_clean_data2$data_with_fix_concat,
clean_dataset_uuid_column = "X_uuid",
cleaning_log = my_filled_log_no_deletion,
cleaning_log_uuid_column = "uuid",
cleaning_log_question_column = "question",
cleaning_log_new_value_column = "new_value",
cleaning_log_change_type_column = "change_type",
cleaning_log_old_value_column = "old_value",
deletion_log = my_deletion_log,
deletion_log_uuid_column = "uuid"
)Exercises
Exercise 1
- Export the cleaning log you have created previously. The previous log is loaded below.
previous_exercise_log <- readRDS("inputs/03 - exercise - previous_log.RDS")
previous_exercise_log %>% names()[1] "checked_dataset" "percentage_missing_log" "potential_PII"
[4] "logical_all"
Exercise 2
- Create the clean data from the raw dataset and the filled cleaning.
exercise_filled_log <- readxl::read_excel("inputs/04 - exercise - cleaning_log - filled.xlsx", sheet = "cleaning_log")Exercise 3
- Review the cleaning below, if there is someone else doing the exercise, you can try to review someone’s cleaning.
exercise3_clean_dataset <- readxl::read_excel("inputs/05 - exercise - clean dataset for review.xlsx")
exercise3_cleaning_log <- readxl::read_excel("inputs/05 - exercise - clean dataset for review.xlsx", sheet = 2)Review templates
There are some project templates that can be used to review the cleaning. More information on this repository.
Composition - adding indicators
The framework is built around 4 steps: cleaning, composition, analysis, outputs
- Cleaning: any manipulation to go from the raw data to the clean data
- Composition: any manipulation before the analysis e.g. adding indicators, adding information from loop or main, aok aggregation, etc.
- Analysis: any manipulation regarding only the analysis
- Outputs: any manipulation to format the outputs.
The following section will present some introduction about the composition.
library(addindicators)
library(dplyr)
my_data <- addindicators::addindicators_MSNA_template_dataadd_fcs
my_data_with_fcs <- my_data %>% add_fcs(
cutoffs = "normal",
fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
fsl_fcs_legumes = "fs_fcs_beans_nuts",
fsl_fcs_veg = "fs_fcs_vegetables_leaves",
fsl_fcs_fruit = "fs_fcs_fruit",
fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
fsl_fcs_dairy = "fs_fcs_dairy",
fsl_fcs_sugar = "fs_fcs_sugar",
fsl_fcs_oil = "fs_fcs_oil_fat_butter"
)
my_data_with_fcs[, tail(names(my_data_with_fcs), 10)] %>%
head()| fcs_weight_cereal1 | fcs_weight_legume2 | fcs_weight_dairy3 | fcs_weight_meat4 | fcs_weight_veg5 | fcs_weight_fruit6 | fcs_weight_oil7 | fcs_weight_sugar8 | fsl_fcs_score | fsl_fcs_cat |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 28 | 5 | 6 | 1.0 | 0.5 | 48.5 | Acceptable |
| 4 | 6 | 12 | 16 | 7 | 2 | 2.5 | 1.0 | 50.5 | Acceptable |
| 8 | 0 | 0 | 24 | 7 | 6 | 1.0 | 0.5 | 46.5 | Acceptable |
| 14 | 3 | 4 | 12 | 4 | 2 | 1.0 | 2.5 | 42.5 | Acceptable |
| 6 | 6 | 4 | 20 | 6 | 7 | 3.5 | 3.0 | 55.5 | Acceptable |
| 0 | 6 | 16 | 4 | 7 | 1 | 0.0 | 3.5 | 37.5 | Acceptable |
You can learn more about food security indicators here.
add_hhs
my_data_with_indicators <- my_data %>%
add_fcs(
cutoffs = "normal",
fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
fsl_fcs_legumes = "fs_fcs_beans_nuts",
fsl_fcs_veg = "fs_fcs_vegetables_leaves",
fsl_fcs_fruit = "fs_fcs_fruit",
fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
fsl_fcs_dairy = "fs_fcs_dairy",
fsl_fcs_sugar = "fs_fcs_sugar",
fsl_fcs_oil = "fs_fcs_oil_fat_butter"
) %>%
add_hhs(
)
my_data_with_indicators[, tail(names(my_data_with_indicators), 14)] %>%
head()| fsl_fcs_score | fsl_fcs_cat | fs_hhs_nofood_yn_recoded | fs_hhs_nofood_freq_recoded | fs_hhs_sleephungry_yn_recoded | fs_hhs_sleephungry_freq_recoded | fs_hhs_daynoteating_yn_recoded | fs_hhs_daynoteating_freq_recoded | hhs_comp1 | hhs_comp2 | hhs_comp3 | hhs_score | hhs_cat_ipc | hhs_cat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48.5 | Acceptable | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | None | No or Little |
| 50.5 | Acceptable | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | Little | No or Little |
| 46.5 | Acceptable | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 2 | 0 | 2 | Moderate | Moderate |
| 42.5 | Acceptable | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | None | No or Little |
| 55.5 | Acceptable | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | None | No or Little |
| 37.5 | Acceptable | 1 | 1 | 0 | 2 | 1 | 2 | 1 | 0 | 2 | 3 | Moderate | Moderate |
Composition - reviewing indicators
Reviewing indicators will compare 2 indicators together and present the differences. It will not check how the indicator was created nor check for inconsistencies. That mean, to review an indicator, it is necessary to create one and compare them. The functions review_one_variable and review_variables will focus on the latter.
review_variables
First, a new dataset can be created for the review.
review_df <- addindicators_MSNA_template_data %>%
add_fcs(
cutoffs = "normal",
fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
fsl_fcs_legumes = "fs_fcs_beans_nuts",
fsl_fcs_veg = "fs_fcs_vegetables_leaves",
fsl_fcs_fruit = "fs_fcs_fruit",
fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
fsl_fcs_dairy = "fs_fcs_dairy",
fsl_fcs_sugar = "fs_fcs_sugar",
fsl_fcs_oil = "fs_fcs_oil_fat_butter"
) %>%
select(uuid, fsl_fcs_score, fsl_fcs_cat)Then the dataset to be reviewed and the new dataset can be binded together.
binded_df <- my_data_with_indicators %>%
full_join(review_df, by = "uuid")I would advice to use a full_join rather than a left/right_join. That way if any computation has missing value they will be spotted.
With the join_* if the names are the same .x and .y will added to the names.
review_one_var <- review_variables(binded_df,
columns_to_review = "fsl_fcs_cat.x",
columns_to_compare_with = "fsl_fcs_cat.y")
review_one_var %>%
names()[1] "dataset" "review_table"
It is a list with the dataset and a review table.
review_one_var$review_table %>%
head()| uuid | variable | review_check | review_comment |
|---|---|---|---|
| eaf540cd-32bd-41474b-b4beb5-d62fc987e45a | fsl_fcs_cat.x | TRUE | Same results |
| 89e706c3-53d8-4a4049-898586-4926085db71e | fsl_fcs_cat.x | TRUE | Same results |
| afd921c6-e54a-4c4740-919c93-87f59bd0e63a | fsl_fcs_cat.x | TRUE | Same results |
| d8b05f39-ba85-494c4d-808c84-9dc57823a4f1 | fsl_fcs_cat.x | TRUE | Same results |
| d6b42f9e-c209-4c4541-808a81-86bea53df142 | fsl_fcs_cat.x | TRUE | Same results |
| f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 | fsl_fcs_cat.x | TRUE | Same results |
The review table can be summarised to have a quicker overview.
review_one_var$review_table %>%
group_by(review_check, review_comment) %>%
tally()| review_check | review_comment | n |
|---|---|---|
| TRUE | Same results | 100 |
To see how differences are shown, some noise is introduced to the dataset.
jittered_df <- binded_df
set.seed(123)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.x"] <- sample(unique(jittered_df$fsl_fcs_cat.y), 5, T)
set.seed(124)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.y"] <- sample(unique(jittered_df$fsl_fcs_cat.y), 5, T)
set.seed(125)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.x"] <- NA
set.seed(1236)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.y"] <- NA
set.seed(1237)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_score.x"] <- sample(unique(jittered_df$fsl_fcs_score.x), 5, T)review_one_variable_jittered <- review_variables(jittered_df,
columns_to_review = "fsl_fcs_cat.x",
columns_to_compare_with = "fsl_fcs_cat.y")
review_one_variable_jittered$review_table %>%
group_by(review_check, review_comment) %>%
tally()| review_check | review_comment | n |
|---|---|---|
| FALSE | Different results | 9 |
| FALSE | Missing in fsl_fcs_cat.x | 5 |
| FALSE | Missing in fsl_fcs_cat.y | 5 |
| TRUE | Same results | 81 |
The dataset has new columns to help filtering for further investigation.
review_one_variable_jittered$dataset[, tail(names(review_one_variable_jittered$dataset), 5)] %>%
head()| hhs_cat | fsl_fcs_score.y | fsl_fcs_cat.y | review_check_fsl_fcs_cat.x | review_comment_fsl_fcs_cat.x |
|---|---|---|---|---|
| No or Little | 48.5 | Acceptable | TRUE | Same results |
| No or Little | 50.5 | Acceptable | TRUE | Same results |
| Moderate | 46.5 | Acceptable | TRUE | Same results |
| No or Little | 42.5 | Acceptable | TRUE | Same results |
| No or Little | 55.5 | Acceptable | TRUE | Same results |
| Moderate | 37.5 | Poor | FALSE | Different results |
review_one_variable_jittered$dataset %>%
filter(!review_check_fsl_fcs_cat.x) %>%
select(uuid, fsl_fcs_cat.x, fsl_fcs_cat.y, review_check_fsl_fcs_cat.x, review_comment_fsl_fcs_cat.x)| uuid | fsl_fcs_cat.x | fsl_fcs_cat.y | review_check_fsl_fcs_cat.x | review_comment_fsl_fcs_cat.x |
|---|---|---|---|---|
| f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 | Acceptable | Poor | FALSE | Different results |
| 42dc8573-e2d0-43484b-aaada2-c37ef865d041 | Borderline | Acceptable | FALSE | Different results |
| fcd69a08-498f-4c4b47-989799-743cbe5fd960 | NA | Acceptable | FALSE | Missing in fsl_fcs_cat.x |
| d36b8cfa-bf52-48434f-8b8d88-ef620d73c941 | Poor | Acceptable | FALSE | Different results |
| 72095b68-8c51-484245-929b97-360eda2c4b81 | NA | Acceptable | FALSE | Missing in fsl_fcs_cat.x |
| 675eb3d0-62ba-4e4b42-a3a4af-08f14dce6592 | Poor | Acceptable | FALSE | Different results |
| 952efab8-4a32-4a4743-bcb4b8-d9a534fb8627 | Poor | Acceptable | FALSE | Different results |
| 01648f7a-8521-4d4347-afa3a7-7fd83ea05916 | Acceptable | Borderline | FALSE | Different results |
| 10db47e8-a721-49454a-888385-59068d4a3ef2 | Poor | Borderline | FALSE | Different results |
| 46c81eb3-7243-414f40-919e98-270439d8fbc5 | NA | Acceptable | FALSE | Missing in fsl_fcs_cat.x |
| 765d2341-5df2-43484a-8e888b-362fa085de17 | Acceptable | NA | FALSE | Missing in fsl_fcs_cat.y |
| 0dea8527-2ab9-4e4844-88868e-8379e514ca2b | NA | Acceptable | FALSE | Missing in fsl_fcs_cat.x |
| da823c6f-d215-43474c-b4b0b6-6ba519748c0f | NA | Acceptable | FALSE | Missing in fsl_fcs_cat.x |
| e76f4382-b3ea-4b404e-bebfb0-a6b2d7594c18 | Acceptable | Poor | FALSE | Different results |
| 42698e10-7e19-4d4744-8d8b8c-04d187cb62ae | Borderline | NA | FALSE | Missing in fsl_fcs_cat.y |
| bdc23a6e-1a35-49474b-949e9b-57bf01284ea9 | Acceptable | NA | FALSE | Missing in fsl_fcs_cat.y |
| 2bd1809c-a2c1-424b44-b7b0b4-9a53b24d0e67 | Borderline | Acceptable | FALSE | Different results |
| de416c95-845f-4d4f40-868f8c-e12b3946ad07 | Acceptable | NA | FALSE | Missing in fsl_fcs_cat.y |
| 31f4e76d-c64e-4b4144-bbb1bf-b05ca69d823e | Acceptable | NA | FALSE | Missing in fsl_fcs_cat.y |
If there are more than one variable to review, pair-wise vectors can be used.
my_review <- review_variables(jittered_df,
columns_to_review = c("fsl_fcs_cat.x", "fsl_fcs_score.x"),
columns_to_compare_with = c("fsl_fcs_cat.y", "fsl_fcs_score.y")
)my_review$review_table %>%
group_by(variable, review_check, review_comment) %>%
tally()| variable | review_check | review_comment | n |
|---|---|---|---|
| fsl_fcs_cat.x | FALSE | Different results | 9 |
| fsl_fcs_cat.x | FALSE | Missing in fsl_fcs_cat.x | 5 |
| fsl_fcs_cat.x | FALSE | Missing in fsl_fcs_cat.y | 5 |
| fsl_fcs_cat.x | TRUE | Same results | 81 |
| fsl_fcs_score.x | FALSE | Different results | 5 |
| fsl_fcs_score.x | TRUE | Same results | 95 |
my_review$dataset %>%
filter(!review_check_fsl_fcs_cat.x) %>%
select(uuid, fsl_fcs_cat.x, fsl_fcs_cat.y, review_comment_fsl_fcs_cat.x)| uuid | fsl_fcs_cat.x | fsl_fcs_cat.y | review_comment_fsl_fcs_cat.x |
|---|---|---|---|
| f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 | Acceptable | Poor | Different results |
| 42dc8573-e2d0-43484b-aaada2-c37ef865d041 | Borderline | Acceptable | Different results |
| fcd69a08-498f-4c4b47-989799-743cbe5fd960 | NA | Acceptable | Missing in fsl_fcs_cat.x |
| d36b8cfa-bf52-48434f-8b8d88-ef620d73c941 | Poor | Acceptable | Different results |
| 72095b68-8c51-484245-929b97-360eda2c4b81 | NA | Acceptable | Missing in fsl_fcs_cat.x |
| 675eb3d0-62ba-4e4b42-a3a4af-08f14dce6592 | Poor | Acceptable | Different results |
| 952efab8-4a32-4a4743-bcb4b8-d9a534fb8627 | Poor | Acceptable | Different results |
| 01648f7a-8521-4d4347-afa3a7-7fd83ea05916 | Acceptable | Borderline | Different results |
| 10db47e8-a721-49454a-888385-59068d4a3ef2 | Poor | Borderline | Different results |
| 46c81eb3-7243-414f40-919e98-270439d8fbc5 | NA | Acceptable | Missing in fsl_fcs_cat.x |
| 765d2341-5df2-43484a-8e888b-362fa085de17 | Acceptable | NA | Missing in fsl_fcs_cat.y |
| 0dea8527-2ab9-4e4844-88868e-8379e514ca2b | NA | Acceptable | Missing in fsl_fcs_cat.x |
| da823c6f-d215-43474c-b4b0b6-6ba519748c0f | NA | Acceptable | Missing in fsl_fcs_cat.x |
| e76f4382-b3ea-4b404e-bebfb0-a6b2d7594c18 | Acceptable | Poor | Different results |
| 42698e10-7e19-4d4744-8d8b8c-04d187cb62ae | Borderline | NA | Missing in fsl_fcs_cat.y |
| bdc23a6e-1a35-49474b-949e9b-57bf01284ea9 | Acceptable | NA | Missing in fsl_fcs_cat.y |
| 2bd1809c-a2c1-424b44-b7b0b4-9a53b24d0e67 | Borderline | Acceptable | Different results |
| de416c95-845f-4d4f40-868f8c-e12b3946ad07 | Acceptable | NA | Missing in fsl_fcs_cat.y |
| 31f4e76d-c64e-4b4144-bbb1bf-b05ca69d823e | Acceptable | NA | Missing in fsl_fcs_cat.y |
my_review$dataset %>%
filter(!review_check_fsl_fcs_score.x) %>%
select(uuid, fsl_fcs_score.x, fsl_fcs_score.y, review_comment_fsl_fcs_score.x)| uuid | fsl_fcs_score.x | fsl_fcs_score.y | review_comment_fsl_fcs_score.x |
|---|---|---|---|
| afd921c6-e54a-4c4740-919c93-87f59bd0e63a | 87.0 | 46.5 | Different results |
| c14529b6-06f7-4d4446-a1a7ac-b3648529ef7d | 61.0 | 78.5 | Different results |
| 4a78b1d6-ad91-4b4f45-aba5a0-bed8cf257106 | 41.5 | 30.0 | Different results |
| 26bce981-8217-464d45-979a96-8a69e5371b02 | 76.5 | 47.5 | Different results |
| ad31fe62-5a3c-484d49-b7b8bc-a79d8304be65 | 37.5 | 42.5 | Different results |
Exercises
Exercise 1
- Add the food consumption matrix score to the dataset. The food consumption matrix score is a food security indicator that uses the food consumption score, household hunger score and the reduced coping strategy index.
| name | label::english | type |
|---|---|---|
| rCSILessQlty | During the last 7 days, were there days (and, if so, how many) when your household had to rely on less preferred and less expensive food to cope with a lack of food or money to buy it? | integer |
| rCSIBorrow | During the last 7 days, were there days (and, if so, how many) when your household had to borrow food or rely on help from a relative or friend to cope with a lack of food or money to buy it? | integer |
| rCSIMealSize | During the last 7 days, were there days (and, if so, how many) when your household had to limit portion size of meals at meal times to cope with a lack of food or money to buy it? | integer |
| rCSIMealAdult | During the last 7 days, were there days (and, if so, how many) when your household had to restrict consumption by adults in order for small children to eat to cope with a lack of food or money to buy it? | integer |
| rCSIMealNb | During the last 7 days, were there days (and, if so, how many) when your household had to reduce number of meals eaten in a day to cope with a lack of food or money to buy it? | integer |
library(addindicators)
library(dplyr)exercise_data <- addindicators_MSNA_template_data %>%
add_fcs(
cutoffs = "normal",
fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
fsl_fcs_legumes = "fs_fcs_beans_nuts",
fsl_fcs_veg = "fs_fcs_vegetables_leaves",
fsl_fcs_fruit = "fs_fcs_fruit",
fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
fsl_fcs_dairy = "fs_fcs_dairy",
fsl_fcs_sugar = "fs_fcs_sugar",
fsl_fcs_oil = "fs_fcs_oil_fat_butter"
) %>%
add_hhs(
)Exercise 2
You receive a dataset, you need to review the following four indicators.
- Food Consumption Score: fcs_score, fcs_cat
- Household Hunger Score: hhs_score, hhs_cat
Don’t forget to write the review.
dataset_to_review <- read.csv("inputs/06 - exercise - dataset_to_review.csv")
dataset_without_indicators <- addindicators::addindicators_MSNA_template_dataimpactR4PHU
The functions about public health checks and indicators are more updated on the impactR4PHU. More information on this repository.